Skip to content

Fix DeployBot recovery deadlock from stale re-pause#48

Merged
cursor[bot] merged 1 commit into
mainfrom
cursor/fix-recovery-deadlock-stale-repause-fc6d
Jun 22, 2026
Merged

Fix DeployBot recovery deadlock from stale re-pause#48
cursor[bot] merged 1 commit into
mainfrom
cursor/fix-recovery-deadlock-stale-repause-fc6d

Conversation

@mberman84

Copy link
Copy Markdown
Contributor

Problem

A failed exact-main release that an operator explicitly unpaused could never land its repair, deadlocking delivery.

Reproduction (from the report):

  • CI failed for main 79cbedf6….
  • Repair PR #970 was queued and fully green at 86152d68….
  • deploybot unpause created a running recovery control.
  • The next DeployBot run ignored that recovery, saw the old failed CI, returned release-held, and wrote a new ci-failed pause.
  • The repair could never merge, because merging it was the only way to produce a new passing main CI run.

The root cause is that the original CI failure lingers on the failed SHA until the repair merges. Any coordinator run — especially a workflow pinned to an older DeployBot release — rereads that same result, writes a fresh, byte-identical pause, and overwrites the running recovery. The release-admission fence also keeps holding that failed SHA, so the repair never drains.

Fix

Three coupled changes break the deadlock and make recovery robust against stale reconciliation:

  1. Preserve the recovery (records reconciliation). A recovery now remembers the exact SHA and reason it resumed. An unconditional pause that merely restates that already-recovered failure is ignored, so a lagging or older-version worker can no longer clobber the recovery. A genuinely new failure (a different SHA, or a different reason such as a later deploy failure) still pauses normally, preserving concurrent-pause ownership.

  2. Merge the repair (reactor admission fence). While a recovery owns the current failed main, the release-admission fence no longer holds that SHA. The elected repair drains and advances main past the failed revision, and the new main is then followed and verified normally.

  3. Wake immediately (unpause). After recording the durable recovery, DeployBot reacts right away so the repair merges without waiting for the next delivery event or the five-minute reconciliation sweep. --no-wake opts out, and --follow/--dispatch-ci/--timeout shape that wake-up reaction. The recovery is already durable, so a transient wake-up error is reported but never re-pauses the pipeline.

Tests

New regressions cover the stale re-pause race, a genuinely new failure still pausing, the reactor merging a repair during recovery (and the bypass staying scoped to the exact unpaused SHA), and the unpause wake (including opt-out and durable-recovery-survives-wake-failure).

All existing behavior is preserved. CI parity verified locally: ruff check src tests, python -m unittest discover -s tests (251 tests OK), and python -m build all pass.

Note on versioning

The runtime pin (RELEASE_COMMIT / vX.Y.Z references in README, action, and client configs) points at a published release commit, so it is intentionally left to the separate release-pin step once this fix has a merge commit. The code fix itself is version-independent and robust even during a rolling upgrade.

Open in Web Open in Cursor 

A failed exact-main release that an operator explicitly unpaused could
never land its repair. The original CI failure lingers on the failed SHA
until the repair merges, so any coordinator run (especially a workflow
pinned to an older release) rereads that result, writes a fresh, byte
identical pause, and overwrites the running recovery. The repair then
sits behind a release that can only turn green once the repair merges.

Three coupled fixes break the deadlock:

- records.latest_control: a recovery now carries the exact SHA and reason
  it resumed, and an unconditional pause that merely restates that same
  already-recovered failure is ignored. A genuinely new failure (a
  different SHA, or a different reason such as a later deploy failure)
  still pauses normally, so concurrent-pause ownership is preserved.

- command_react: while a recovery owns the current failed main, the
  release-admission fence no longer holds that SHA. The elected repair
  drains and advances main past the failed revision; the new main is then
  followed and verified normally.

- command_unpause: after recording the durable recovery, DeployBot reacts
  immediately so the repair merges without waiting for the next delivery
  event or the five-minute reconciliation sweep. --no-wake opts out, and
  --follow/--dispatch-ci/--timeout shape the wake-up reaction. The
  recovery is durable, so a transient wake-up error is reported but never
  re-pauses the pipeline.

Adds regressions covering the stale re-pause race, the reactor merging a
repair during recovery, the scoping of that bypass, and the unpause wake.

Co-authored-by: mberman84 <mberman84@users.noreply.github.com>
@cursor cursor Bot merged commit d615630 into main Jun 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants